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Towards Energy-Proportional Computing Using 
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Abstract —Massive data centers housing thousands of computing nodes have become commonplace in enterprise computing, and the 
po\A/er consumption of such data centers is growing at an unprecedented rate. Adding to the problem is the inability of the servers to exhibit 
energy proportionality, i.e., provide energy-efficient execution under all levels of utilization, which diminishes the overall energy efficiency of 
the data center. It is imperative that we realize effective strategies to control the power consumption of the server and improve the energy 
efficiency of data centers. With the advent of Intel Sandy Bridge processors, we have the ability to specify a limit on power consumption 
during runtime, which creates opportunities to design new power-management techniques for enterprise workloads and make the systems 
that they run on more energy-proportional. 

In this paper, we investigate whether it is possible to achieve energy proportionality for enterprise-class server workloads, namely 
SPECpower_ssj2008 and SPECweb2009 benchmarks, by using Intel’s Running Average Power Limit (RAPE) interfaces. First, we analyze 
the average power consumption of the full system as well as the subsystems and describe the energy proportionality of these components. 
We then characterize the instantaneous power profile of these benchmarks within different subsystems using the on-chip energy meters 
exposed via the RAPE interfaces. Finally, we present the effects of power limiting on the energy proportionality, performance, power and 
energy efficiency of enterprise-class server workloads. Our observations and results shed light on the efficacy of the RAPE interfaces and 
provide guidance for designing power-management techniques for enterprise-class workloads. 

Index Terms —Power Limiting, Energy Proportionality, RAPE, Enterprise Computing, SPEC 
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1 Introduction 

Massive data centers, which house thousands of computing 
nodes, have become increasingly more common. A large 
fraction of such data centers' total cost of ownership 
(TCO) comes from fhe cosf of building and mainfaining 
infrasfrucfure fhaf is capable of powering such large- 
scale data centers and from the recurring energy costs |j^. 
Consequently, power and energy have emerged as first- 
order design constraints in data centers. These issues are 
further magnified by fhe inabilify of servers fo provide 
energy-efficient execution at all levels of utilizafion (i.e., 
load-levels). 

Figure shows fhe power consumpfion of a compufe 
server running SPECpower under different load-levels and 
the h 3 qjothetical linear and ideal (i.e., energy-proportional) 
non-peak power curves. As evident from the figure, there 
is room to improve the non-peak power efficiency of 
fhe server wifh respecf to both the ideal as well as 
linear power curves. The recent recommendation of energy 
proportionality in servers, i.e., to design servers that consume 
power proportional to the utilization, is a move in the 
right direction as it has the potential to double the 
energy efficiency of servers Q. However, achieving energy- 
proporfional operafion is a challenging fask, parficularly 
given that typical servers consume 35-45% of peak power, 
even when idling. 

Typically, dynamic voltage and frequency scaling (DVFS) 
has been used to achieve better energy efficiency as it can 
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Fig. 1. Illustration of SPECpower Energy Proportionality 

potentially give up to cubic energy savings 
However, as we will show in this paper, the subsystem 
affected by DVFS (i.e., the corJQ is already the most 
energy-proportional part of the system. There are other 
subsystems, such as the uncorej^that consume constant 
power, irrespective of fhe sysfem ufilizafion. In order fo 
achieve energy proportionality, we need to understand the 
power consumption of each subsystem at different levels 
of utilization and to leverage mechanisms that enable us 
to control the power consumption of fhese subsysfems. 

With the advent of Infel Sandy Bridge processors, we 
have beffer confrol over fhe power consumpfion of fhe 

1. The core subsystem includes components such as the ALUs, FPUs, 
LI, and L2 caches 0. 

2. The uncore subsystem includes components such as the memory 
controller, integrated I/O, and coherence engine 0. 
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system via the Running Average Power Limit (RAPL) 
interfaces ||^, p0| . RAPL exposes on-chip energy meters 
for fhe core subsysfem, processor package, and DRAM 
and enables the tracking of power consumption at a time 
resolution (^1 ms) and system-level granularity that was 
not possible before. Moreover, if facilifafes deferminisfic 
control over the power consumption of subsystems through 
power limiting interfaces. These interfaces allow a user to 
specify a power bound and a fime window over which 
fhe bound should be mainfained. While fhis hardware- 
enforced power limiting is an appealing option, the impact 
of power limiting on the performance, power, and energy 
efficiency of enterprise-class server workloads is still not 
well understood and remains an active area of research. 

In fhis paper, we invesfigafe whefher if is possible 
fo achieve energy-proportional operation for enferprise- 
class server workloads, namely fhe SPECpower_ssj2008 
and SPECweb2009 benchmarks (henceforth referred to as 
SPECpower and SPECweb respectively) by using the RAPL 
interfaces. To fhis end, fhis paper makes fhe following 
confribufions: (i) insighfs info fhe mechanisms of power 
managemenf for enferprise-class server workloads using 
the RAPL interfaces via an analysis of the SPECpower and 
SPECweb benchmarks by calibrating its input parameters, 
(ii) a rigorous quantification of the energy proportionality 
of each subsystem within a server node via an analysis of 
power consumpfion profiles of fhe differenf subsysfems 
when rurming SPECpower and SPECweb af differenf 
load-levels, (iii) an analysis and characferizafion of fhe 
insfanfaneous-power profiles af differenf load-levels of 
SPECpower and SPECweb to understand whether power 
limiting will enable us to improve the energy efficiency of 
these benchmarks and (iv) empirical results on the impact 
of RAPL power limiting on average power, performance, 
insfanfaneous power, and energy efficiency. 

Through our confribufions, we make fhe following 
observations and conclusions on the power management 
of the SPECpower and SPECweb benchmarks using RAPL 
interfaces: 

• The core is the most energy-proportional subsystem and 
the uncore is the least. 

• Better power management mechanisms are required to 
achieve energy proportionality at the uncore subsystem- 
level. 

• There is ample opportunity for limiting the power con¬ 
sumption of processor package and memory subsystems. 

• Power limiting at the level of the core subsystem is the 
best option for improving energy efficiency and achieving 
energy proporfionalify. 

• Though we were nof able fo achieve energy propor¬ 
fionalify af fhe full sysfem level, i.e., entire compufe 
node, we show fhaf energy-proportional operation or 
better is possible at the granularity of subsystems over 
which we have control via RAPL power limiting (i.e., 
core subsystem, processor package, and DRAM). 


and SPECweb benchmarks and Intel RAPL interfaces. 
Secfion 1^ describes our analysis and characferizafion of 
average power consumpfion. If presenfs defails on fhe 
energy proportionality of full system as well as subsytems. 
Section details the instantaneous power profile of all 
subsysfems af differenf load levels in SPECpower and 
SPECweb and fhe observafions from fhese experimenfs. 
Nexf in Section we limif fhe power consumpfion of 
SPECpower and SPECweb fo sfudy fhe impacf of if on 
fhe power, performance and energy efficiency of fhese 
benchmarks. In Section we describe fhe relafed work, 
and we conclude in Section |8] 


2 Background 

In this section, we provide an overview of the SPECpower 
benchmark and its design as well as details into its 
configurable parameters. We then present the control and 
capabilities exposed by Intel's RAPL interfaces. 


2.1 Overview of SPECpower Benchmark 

SPECpower |18| is an industry-standard benchmark that 


measures both the power and performance of a server 
node. The benchmark mimics a server-side Java transaction 
processing application. It stresses the CPU, caches, and 
memory hierarchy and tests the implementations of fhe 
Java virfual machine 0VM), jusf-in-fime (JIT) compiler, 
garbage collection, and threads. The benchmark requires 
two systems: (1) the system under test or SUT and (2) the 
control and collection system (CCS) with communication 
between the systems established via Ethernet|^ The SUT 
runs the workload and is connected to a power meter. The 
power meter, in turn, is cormected to the CCS. The CCS 
collects the performance and power dafa passed fo if by 
fhe SUT and power mefer, respecfively. 

The SPECpower benchmark is designed fo produce 
consisfenf and repeafable performance and power mea¬ 
surements. It executes different type of transactions and the 
transactions are grouped together in batches for scheduling 
purposes. Each load-level is achieved by controlling delay 
between the arrival of bafches. 

More specifically, the SPECpower benchmark is a grad¬ 
uated workload, i.e., it runs the workload at different 
load-levels and reports the power and performance at each 
load-level. The benchmark starts with a calibration phase, 
which determines the maximum throughput. The calibrated 
throughput is set as the throughput target for 100% load- 
level. The throughput target for the rest of the load-levels 
is calculated as a percentage of the throughput target for 
100% load-level. For example, if the throughput target for 
100% load-level is 100,000, fhen fhe fargef for 70% load- 
level is 70,000, 40% is 40,000 and so on. The fhroughpuf 
is measured in server-side Java operations per second 
(ssj_ops). 


The resf of fhe paper is organized as follows. In 
Section 1^ we presenf fhe defails of fhe SPECpower 


3. SUT and CCS can be the same system. Communication is established 
via Ethernet only if the systems are different. 
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The benchmark supports a set of configurable parame¬ 
ters]^ For example, the maximum target throughput and 
the batch size can be manually configured. We refer the 
reader to |20| for furfher informafion on configurable 
paramefers. The flexibilify, coupled wifh fhe consisfency 
and repeatability of SPECpower, allows us to evaluate the 
applicability of newer power-management interfaces, such 
as RAPT, fo enferprise-class server workloads. 


2.2 Overview of SPECweb Benchmark 

SPECweb is an industry standard benchmark for measuring 
front-end web server performance. It allows the user 
to measure performance based on fhe requesf handling 
capability and response time maintained by a server node. 
The benchmark consists of four differenf componenfs: 

1) Clienf: If runs fhe applicafion program which sends 
HTTP requesfs fo fhe server and receives the corre¬ 
sponding response from fhe server. 

2) Web Server: If handles fhe requests issued by the 
client. This is also the system under test (SUT) for 
fhis benchmark. 

3) Back-End Simulator (BeSim): It emulates a back-end 
application server. The web server communicates with 
Besim to retrieve specific informafion required fo 
complefe a request from one of fhe clienfs. 

4) Prime Client: It initiates and controls the clients and 
also initializes the web server and Besim. It collects 
performance results for fhe benchmark. 

The main performance and power metric for the bench¬ 
mark is simultaneous user sessions (SUS) and SUS/watt 
respectively. In addition to SUS, the SPECweb benchmarks 
adds two different response time performance mefrics, 
namely TIME_GOOD and TIMEJTOLERABLE. By default, 
95% and 99% of the requests should have response time 
less that TIME_GOOD and TIME_TOLERABLE respectively. 
Similar to SPECpower, we can control the benchmark 
parameters to execute the benchmark at different load-levels 
i.e., different SUS. This benchmark also allows us tweak a 
set of input parameters. We refer the reader to pT| for a 
full lisf of configurable parameters. 


2.3 Intel’s Running Average Power Limit (RAPL) In¬ 
terfaces 

RAPE was introduced in Intel Sandy Bridge processors. The 
RAPL interfaces provide mechanisms to enforce power 
consumpfion limifs on a specific subsysfem. The only 
official documenfafion available for fhese inferfaces is 
section 14.7 of the Intel software developer's manual [ (T0| . 
Our experiments deal only with the Sandy Bridge server 
platforms. 

The RAPL inferfaces can be programmed using fhe 
model-specific registers (MSRs). MSRs are used for per¬ 
formance monitoring and controlling hardware functions. 
These registers can be accessed using two instructions: 


(1) rdmsr, short for "read model-specific regisfers" and 

(2) wrmsr, short for "wrife model-specific regisfers." The 
msr kernel module can be used for accessing MSRs from 
user space in Linux environmenfs. When loaded, fhe msr 
module exposes a file interface at fdevfcpufxfmsr. This file 
interface can be used to read from or write to any MSR 
on that CPU. 

According to the Intel documentation, RAPL interfaces 
operate at the granularity of a processor sockef. The 
server platforms provide control over three domains (i.e., 
subsystems)]^ (1) package (PKG), (2) power plane 0 (PPO), 
and (3) DRAM. PKG, PPO and DRAM represents the 
processor package (or socket), the core subsystem, and 
memory DIMMs associated with that socket, respectively. 
The MSR_RAPL_POWER_UNIT register contains the units 
for specifying time, power, and energy, and the values 
are architecture-specific. For example, our testbed requires 
and reports time, power, and energy at increments of 
976 microseconds, 0.125 wafts, and 15.3 microjoules, 
respectively. Each domain consists of its own set of RAPL 
MSR interfaces. On a server platform, RAPL exposes four 
capabilifies: 

1) Power limiting - Interface fo enforce limifs on power 
consumption. 

2) Energy metering - Interface reporfing acfual energy 
usage informafion. 

3) Performance sfafus - Inferface reporfing performance 
impacf due to power limit. 

4) Power information - Interface which provides value 
range for confrol affribufes associafed wifh power 
limiting. 

2.3.1 Power Limiting 

RAPL maintains an average power limit over a sliding win¬ 
dow instead of enforcing strict limits on the instantaneous 
power. The advantage of having an average power limit 
is that if fhe average performance requiremenf is within 
the specified power limifs fhe workload will nof incur 
any performance degradafion even if fhe performance 
requirement well exceeds the power limit over short bursts 
of time. The user has to provide a power bound and a 
time window in which the limit has to be maintained. 
Each RAPL domain exposes a MSR which is used for 
programming these values. The PKG domain provides two 
power limits and associated time window for finer control 
over the workload performance whereas other domains 
provide only one power limit. The interface provides a 
clamping ability, which when enabled, allows the processor 
to go below an OS-requested P-state. 

2.3.2 Energy Metering 

Each domain exposes a MSR interface fhaf reporfs the 
energy consumed by that domain. On a server platform, 

(1) energy(PKG) = energy consumed by the processor package, 

(2) energy(PPO) = energy consumed by the core subsystem, and 

(3) energy(uncore subsystem) = energy(PKG) — energy(PPO). 


4. Only a subset of these parameters can be changed for compliant 
runs. 


5. Note: We use RAPL domain and subsystem interchangeably in rest 
of the paper. 
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TABLE 1 

Per-Socket Parameter Range (MTW = Maximum Time 
Window, MaxP = Maximum Power, MinP = Minimum Power). 
Muitipiy by 2 for Fuii Two-Socket System. 


Domain/Range 

MTW 

MaxP 

MinP 

Package 

45.89 ms 

180 watts 

51 watts 

DRAM 

39.06 ms 

75 watts 

15 watts 


2.3.3 Performance Status 

This MSR interface reports the total time for which each 
domain was throttled (i.e., functioning below the OS- 
requested P-state) due to the enforced power limit. This 
information will be useful in understanding the effects of 
power limiting on a particular workload. 

2.3.4 Power Information 

The PKG and DRAM domains expose a MSR interface 
that provides information on the ranges of values that can 
be specified for a particular RAPL domain for limiting its 
power consumption. This includes maximum time window, 
maximum power, and minimum power. The range of per- 
socket values on our experimental platform is given in 
Table [T] 

3 Experimental Setup 

The SUT for our experiments is an Intel Xeon E5-2665 
processor (Intel Romley-EP). The node has two such 
processors for a total of 16 cores and 32 cores when 
hyperthreading is ON. It has 256 GB of memory and 
runs a Linux kernel version 3.2.0. We used a Yokogawa 
WT210 power meter for full system power measurements. 

3.1 Setup for SPECpower 

The CCS has an Intel Xeon E5405 processor with dual quad 
cores and 8 GB of RAM. The GCS runs a Linux kernel 
version 2.6.32. The CCS and SUT were cormected through 
a gigabit Ethernet network. We used all the cores in SUT 
for our experiments. Eight JVMs with four threads for each 
JVM were used as the configuration for SPECpower. The 
four threads in each JVM were pinned to two adjacent 
physical cores on the SUT using numactl. To further enhance 
the performance of the SUT, we enabled large page memory 
(HugeTLB) support and set aside 32 GB for huge page 
allocation. Note that HugeTLB support is enabled only for 
SPECpower. In order to provide consistent performance 
results throughout our experiments, we configured the 
input.load_level.target_max_throiighpiit parameter to achieve 
the same performance for each run. It was set to 140,000 
ssj_ops for each JVM for a total of 1,120,000 ssj_ops 
for the entire run. In all our experiments, 100% load- 
level corresponds to 1,120,000 ssj_ops. This value was 
determined by averaging 10 calibration runs. We changed 
the runtime for each load-level to 120 seconds using the 
input.loadJevel.length_seconds parameter and the pre- and 
post-measurement interval to 15 seconds in order to reduce 


the total runtime of the benchmark. We use 1000 as our 
batch size as there is minimal to no effect on power due 
to batch sizes (See Appendix On an average, the SUT 
consumes 120 watts at idle anc 330 watts at 100% load-level 
of SPECpower. We would like to stress that the system 
consumes 36.51% of peak powei|^even when idling. 

3.2 Setup for SPECweb 

We used 26 clients, 1 prime client and 2 Besim for our 
experiments. The prime client is an Intel Xeon E5405 
processor with two quad cores and 8 GB of RAM. The 
Besims had two dual core AMD Opteron 2218 processors 
with 4 GB of RAM. In this paper, we benchmark only 
the SPECweb_PHP_Ecommerce workload. We used a 
Apache installation with php module as our web serving 
application. We setup a bonded Ethernet link with the 
available ports on the SUT to enable data transmission 
upto 2 Gbps. Note that the bonded Ethernet link is only 
setup for SPEGweb. In our experiments for SPEGweb, 
100% load-level corresponds to 13000 SUS. This value was 
determined using empirical analysis (see Appendix 
In addition to the sessions, all our experiments also 
maintain the response time criteria. In our case, 95% 
{TlME_GOOD parameter) and 99% {TIME_TOLERABLE 
parameter) of the requests need to have response times less 
than 3 and 5 seconds, respectively. These response time 
constraints are default values and used in the compliant 
runs. The load-level is changed by manually modifying 
the SIMULTANE0US_SESSI0NS parameter in the input 
configuration. We modified the RUN_SEGONDS input 
parameter to 420 seconds to reduce the runtime of the 
benchmark. Since we focus only on the processor package 
and memory power management, we load all the data 
associated with the Ecommerce workload into RAMPS to 
keep the data set in memory and minimize the involvement 
of disks. On an average, the SUT consumes 120 watts when 
idling and 219 watts at 100% load-level of SPECweb. In 
case of SPECweb, the system consumes 54.88% of peak 
power when idling. 

4 An Analysis of Average Power Con¬ 
sumption 

In this section, we characterize the power consumption of 
the SPECpower and SPECweb benchmarks and analyze 
energy proportionality from the perspective of the entire 
system as well as each RAPL domain. Through our 
experiments, we will show that the most and least energy- 
proportional subsystems are the core (PPO) and the uncore 
(Package-PPO), respectively. 

4.1 Power Consumption Analysis 

As discussed earlier, we are interested in analyzing the 
energy proportionality of the system. The deviation of the 
power curve of the system from the ideal power curve is of 

6. Power consumed at 100% load-level. 
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particular interest to us. To illustrate with an example, we 
would like the area between the system and the ideal power 
curve to be as small as possible in Figure]^ Henceforth, this 
area will be referred to as energy proportionality gap (EPG). 
We are also interested in the linearity of the system power 
curve which is the area between the system power trend 
and linear curve. Henceforth, this area will be referred to 
as linear deviation gap (LDG). 

4.1.1 Properties of Energy-Proportional Systems 

Barroso et. al Q advocated the design of energy- 
proportional systems by addressing power characteristics 
of the server and the behavior of enterprise-class server 
workloads. They proposed two properties of energy- 
proportional systems - low idle power and wide dynamic 
power range. These two properties are particularly illus¬ 
trated by the ideal curve in Figure The ideal curve 
consumes zero power when idling (i.e., at 0% load-level) 
and has a wide dynamic power range. In this paper, we 
will quantify the idle power and the dynamic power range 
as percentage of peak power (i.e., the power consumed at 
100% load-level). 

4.1.2 Energy Proportionality Metric 

We quantify the EPG using two different metrics: (1) the 
EP metric [ (T^ and (2) the PG metric | [2^ . Each of fhese 
metrics serve different purposes and quantifies the energy 
proportionality of the system along different granularities. 
The EP metric is calculated as shown in Equation]^ where 
Areasystem and Areaideai represent the area under the 
system and ideal power curve respectively. A value of 1 for 
the metric represents an ideal energy-proportional system. 
A value of 0 represents a system that consumes a constant 
amount of power irrespecfive of fhe load-level. A value 
greafer fhan 1 represenfs a sysfem which is betfer fhan 
energy-proportional]^ The EP metric gives a perspective of 
the energy proportionality of the system at the full system 
level. 

j^p _ AreOSystem ArCOldeal 

Areaideai 

The PG metric is calculated as shown in Equation 
where XVo represents X% load-level. As observed, the PG 
metric defines fhe EPG at individual load-levels. Por an 
ideal energy-proportional server, the PG for all utilization 
should be 0. 


> 1 and < 1 indicate superlinear and sublinear energy 
proportional systems. 


pp _ -^^^^System 

AvCCLjdeal 


( 3 ) 


We will use the properties described in Section 4.1.1 
along with the EP metric, PG metric and LD metric to 
quantify the energy proportionality. We will also look at the 
EPG and LDG both at full system- and subsystem-levels 
in rest of fhe secfions. 


4.1.3 Methodology 

We used the energy meters exposed in each RAPE domain 
to determine the power dissipated in each domain. In 
all our results, we report the average powei|^ over ten 
runs for the domain-level power consumption. Por full- 
system power measurement, we have followed the power 
measurement methodology specified and developed by 
the SPEC organization for the SPECpower and SPECweb 
benchmarks fl^ . 

4.1.4 Analysis of System- and Subsystem-Level Energy 
Proportionality 

In this section, we present the details on the power 
consumption of SPECpower and SPECweb at a subsystem- 
level. We were able to profile fhe benchmark af a granu¬ 
larity that has not been possible until the advent of Intel 
Sandy Bridge by using the on-chip energy meters exposed 
by the RAPE interfaces. Our results provide insights into 
the energy proportionality of a system as a whole as well 
as at the RAPE domain-level. 


SPECpower - Comparison of Energy Proportionality 



Fig. 2. Analysis of SPECpower Energy Proportionality 


Power System@X% — PowerideamX% 

PGx% = - 5 - (4) 

Power System@100% 

The LDG is quantified using LD mefric [ [25) . The LD 
metric is calculated using Equation [^ Eor an linear energy- 
proportional system, the LD metric will be 0. LD metric 

7. Originally, the EP metric proposed in varied only between 0 and 
1 (i.e., it did not account for better than energy-proportional systems). 
However, in this paper we extend EP metric to account for better than 
energy-proportional system (i.e., 0 < EP metric < 2). 


Eigures [^ and describes the energy proportionality 
of the full system and different subsystems. The Y-axis 
represents the percentage of peak power consumed by 
fhe sysfem or subsysfem and X-axis represenfs fhe load- 
level. As a result, the ideal curve (green line) consumes 
40% of peak power at 40% load-level, 60% of peak power 
af 60% load-level and so on. Eigures [^ and [^ are also 
a compacf comparison of fhe energy proportionalify of 

8. Average power is calculated as (initial energy reading - final energy 
reading)/time. 
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SPECweb - Comparison of Energy Proportionality 



Fig. 3. Analysis of SPECweb Energy Proportionality 

different components of the system. As mentioned earlier, 
we will quantify the energy proportionality using the 
EP, PG and LD metrics and the desired properties of an 
energy-proportional system. 

4.1.4.1 Full System Energy Proportionality: The 
energy proportionality of full system is represented by 
the red line in Figures and The EP metric for full 
system is 0.54 and 0.29 for SPECpower and SPECweb 
respectively. Full system idles at 36.51% and 54.88% of 
peak power for SPECweb and SPECpower. Therefore, it is 
impossible to achieve energy-proportional operation for 
load-levels less than 36% in case of SPECpower and 54% 
in case of SPECweb. The dynamic power range is 63.48% 
for SPECpower and 45.11% for SPECweb 

4.1.4.2 Package (PKG) Energy Proportionality: The 
EP metric value for the package subsystem is 0.70 and 0.44 
for SPECpower and SPECweb respectively. It is also worth 
noting that the power profile of package and full system 
follow a similar trend for both the benchmarks, indicating a 
strong correlation between them[^ The package subsystem 
idles at 21.55% and 34.47% for SPECpower and SPECweb 
respectively. The d 5 mamic power range for SPECpower 
is 78.44% and SPECweb is 65.52%. In general, due to its 
better EP metric, lower idle power and high dynamic power 
range package subsystem is more energy-proportional than 
the full system. 

4.1.4.3 Core (PPO) Energy Proportionality: The purple 
line in Figures and describes the energy proportionality 
of the PPO domain. We observe that this subsystem has near 
energy-proportional power profile for SPECpower bench¬ 
mark. However, it is relatively less energy-proportional in 
case of the SPECweb benchmark. It has a EP metric value 
of 0.85 in case of SPECpower and 0.63 in case of SPECweb. 
This subsystem idles at 5.74 watts (4.83% of peak power) 
and has a dynamic power range of 95.16% of peak power 
for SPECpower. The idle power and dynamic power range 
are 8.80 and 91.19 percent of peak power for the SPECweb 
benchmark. The low idle power coupled with the high 

9. Dynamic power range is calculated as power consumed at 100% 
load-level - 0% load-level. 

10. The Pearson correlation is greater than 0.99. 


dynamic power range makes this subsystem suitable to be 
operated at different power-performance trade-offs. 

4.1.4.4 Uncore (Package-PPO) Energy Proportionality: 
The iincore subsystem's power consumption remains almost 
constant irrespective of the load-level with an EP metric 
value of 0.14 for SPECpower and 0.02 for SPECweb. The 
uncore subsystem has the greatest EPG, and as a result, 
exhibits the worst power consumption trend among the full 
system and RAPT domains from the perspective of energy- 
proportional power scaling. It idles at 84.41% and 94.13% 
of peak power for SPECpower and SPECweb, respectively. 
It has the least dynamic power range among all systems 
and subsystems at 15.58% for SPECpower and 5.86% for 
SPECweb. 

4.1.4.5 Memory (DRAM) Energy Proporfionality: 
The memory subsysfem has EP mefric value of 0.36 for 
SPECpower and 0.07 for SPECweb. In case of SPECweb, 
the memory power trend closely follows the uncore power 
trend which makes it less energy-proportional. This worse 
memory energy proportionality of the SPECweb benchmark 
can be attributed to the usage of RAMFS to house the data 
required by the web server. This subsystem idles at 60.80% 
and 82.62% for SPECpower and SPECweb. 

To summarize, Table|^describes our results on the energy 
proportionality analysis of full sysfem and subsysfems. 

4.1.5 Analysis of Load-Level Energy Proportionality 

The PG metric allows us to look at the energy proportion¬ 
ality of a server at each load-level. Figure shows the PG 
metric at each load-level for SPECpower and SPECweb 
benchmarks. Similar fo EP metric, the uncore and core 
subsystem have the worst and best PG metric for all load- 
levels for both the benchmarks. The uncore subsystem's 
PG increases linearly from 100 fo 0 percenf load-level 
which again shows fhaf the subsystem's power remains 
a constant irrespective of fhe load-level. In case of the 
PPO subsystem, there is an increase in PG metric when 
load-level increases from 0 to 10 percent. This trend shows 
that the energy proportionality gap at 0% load is better 
than low utilization levels for bofh fhe benchmarks. The 
proportionality gap becomes better than 0% load-level only 
at 70% and 80% load-level for SPECpower and SPECweb 
benchmarks, respecfively, for the PPO subsystem. Such 
trends can be seen for Package subsystem and full system 
as well. 

In summary, core is the most energy-proportional and the 
uncore is the least energy-proportional subsystem. 

4.1.6 Analysis of Linear Deviation 

Table |3] shows the LD metric for both benchmarks at each 
RAPE domain and full sysfem. The LD mefric for all 
subsysfems is always positive as none of them have a 
sub-linear energy proportionality trend. This observation 
also provides evidence that there is opportunity to improve 
the energy proportionality by improving (i.e., decreasing) 
LD metric. 
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TABLE 2 

Summary of Full System- and Subsystem-Level Energy Proportionality Analysis. Note: Idle Power and Dynamic Power 

Range are Represented as Percentage of Peak Power. 


Subsystem 

Benchmark 

EP Metric 

Idle Power 

Dynamic Power Range 

Full System 

SPECpower 

0.54 

36.51 

63.48 

SPECweb 

0.29 

54.88 

45.11 

Package (PKG) 

SPECpower 

0.70 

21.55 

78.44 

SPECweb 

0.44 

34.47 

65.52 

Core (PPO 

SPECpower 

0.85 

4.83 

95.16 

SPECweb 

0.63 

8.80 

91.19 

Uncore (Package-PPO) 

SPECpower 

0.14 

84.41 

15.58 

SPECweb 

0.02 

94.13 

5.86 

Memory (DRAM) 

SPECpower 

0.36 

60.80 

39.19 

SPECweb 

0.07 

82.62 

17.37 


SPECpower - PG Comparison 



Fig. 4. Analysis of Load-Level Energy Proportionality 


TABLE 3 

Summary of Full System- and Subsystem-Level Linear 
Deviation Analysis. 


Subsystem 

Benchmark 

LD Metric 

Full System 

SPECpower 

0.067 

SPECweb 

0.101 

Package (PKG) 

SPECpower 

0.066 

SPECweb 

0.151 

Core (PPO 

SPECpower 

0.095 

SPECweb 

0.254 

Uncore (Package-PPO) 

SPECpower 

0.004 

SPECweb 

0.019 

Memory (DRAM) 

SPECpower 

0.013 

SPECweb 

0.053 


5 An Analysis of Instantaneous Power 
Consumption 

Here we present our results for the instantaneous power 
profile analysis of fhe SPECpower and SPEC web bench¬ 
marks. Our main goal is to visualize the opportunities for 
power limiting. We collected instantaneous power profile 
for five load-levels. 

5.1 Methodology 

Our results are shown as cumulative distribution functions 
(CDFs). The CDFs present the percentage of time spent at 
or below a given percentage of the maximum power limit 
possible. We refer the reader to Table for the maximum 


SPECweb - PG Comparison 



power limit possible for each subsysfem. We collecf the 
instantaneous power profile of the package and memory 
subsystems at 50 ms resolution. The results are normalized 
to their respective maximum power limit possible. 

5.2 Instantaneous Power Analysis for Package 
(PKG) Subsystem 


SPECpower Package Power Consumption Distribution 



Normalized Package Power (Actual Power/Peak Power Limit Possible) 


Fig. 5. Analysis of SPECpower Instantaneous Power 
Consumption For Package (PKG) Subsystem 

Figures and show the instantaneous power con¬ 
sumption for package subsysfem for five differenf load- 
levels of SPECpower and SPECweb. We observe fhaf 
fhe maximum power consumed by 100% load-level of 








































































































































SPECweb Package Power Consumption Distribution 



Normaiized Package Power (Actual Power/Peak Power Limit Possible) 

Fig. 6. Analysis of SPECweb Instantaneous Power Con¬ 
sumption For Package (PKG) Subsystem 

SPECpower as indicated by its CDF is lower than 0.5 
normalized power. This indicates that the maximum power 
consumed while executing SPECpower is less than 50% 
of maximum power limit possible. This upper limit for 
package power consumption for the SPECweb benchmark 
is also less than 50% of the maximum power limit possible. 
The lowest point in the CDF of each workload corresponds 
fo fhe minimum power consumed. Af no poinf during 
the execution of that load-level, the subsystem consumes 
lesser power. For example, 100% and 60% load-levels of 
SPECpower do nof consume less fhan 40% and 20% of 
normalized power respectively The shape of fhe curves 
indicafe fhaf each load-level spends mosf of fhe fime 
consuming a narrow range of power. For instance in case 
of SPECpower, 80% load-level spends mosf of fhe fime 
consuming power befween 34% and 46% of maximum 
power limif possible. For bofh fhe benchmarks, we will 
benefit by removing the relatively few intervals (indicated 
by the flat lines at 100%) where the workload has a power 
spike. Power limiting can help in such cases to remove 
these few intervals. We also observe that the power range 
decreases with increase in load-level. In case of SPECpower, 
100% load-level has a power range from 0.40 to 0.50 of the 
normalized power whereas 20% load-level has a power 
range from 0.10 fo 0.32 of fhe normalized power. 

5.3 Instantaneous Power Analysis for Memory 
(DRAM) Subsystem 

Figures]^ and [^describe the instantaneous power consump¬ 
tion for memory subsystem for five different load-levels of 
SPECpower and SPECweb respectively. We observe CDF 
curves similar to package subsystem for bofh fhe bench¬ 
marks. Minimum normalized power consumed af each 
load-level is higher fhan fhe corresponding observafion 
for package subsysfem. This is an expecfed behavior as 
memory subsysfem idles af a higher percenfage of peak 
power than the package subsystem (see Table |^. Similar 
to package subsystem, each load-level spends most of 
fhe fime consuming a narrow range of power. The 100% 
load-level for SPECpower consumes 87 fo 93 percenf of 


SPECpower Memory Power Consumption Distribution 



Normalized Memory Power (Actual Power/Peak Power Limit Possible) 

Fig. 7. Analysis of SPECpower Instantaneous Power 
Consumption For Memory (DRAM) Subsystem 


SPECweb Memory Power Consumption Distribution 



Normalized Memory Power (Actual Power/Peak Power Limit Possible) 


Fig. 8. Analysis of SPECweb Instantaneous Power Con¬ 
sumption For Memory (DRAM) Subsystem 


peak power limit possible leaving lesser opportunity for 
memory power management than other load-levels. The 
memory power consumption for the SPECweb benchmark 
is more narrower than SPECpower as all load-levels of 
SPECweb consume power between 55 to 80 percent of 
the peak power limit possible. In general, there is less 
opportunity to limit the power consumption of memory 
than the package subsystem. 

In summary, there is opportunity to limit the power consump¬ 
tion of SPECpower and SPECweb at different load-levels below 
the 50-ms resolution for the package and DRAM subsystems. 


6 Efficacy of Power Limiting 

In this section, we discuss the effects of power limiting on 
the performance and power of SPECpower and SPECweb 
benchmarks. Specifically, we investigate whether we can 
achieve energy-proportional operation for these bench¬ 
marks by leveraging the RAPE interfaces. Through our 
experiments, we show that most of the power savings 
comes from the PPO domain and memory subsystem power 
limiting contributes the least to achieving power savings. 
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SPECpower System Power Consumption 



SPECpower Processor Core (PPO) Power Consumption 



Fig. 9. Impact of Power Limiting on SPECpower 

6.1 Methodology 

We run both the SPECpower and SPECweb benchmarks 
at five different load-levels (from 20% fo 100% in sfeps 
of 20) under power limit. Our experiments focus on PPO 
and DRAM power limiting. We don't focus on processor 
package power limiting as fhe uncore subsysfem does 
not contribute to power savings at any load-level and all 
the power savings came from fhe PPO domain while we 
experimented with processor package power limiting (22) . 
Our experiments present results for three different power 
limiting scenarios: 

• CPUOnly policy: Performance under only PPO subsystem 
power limit. 

• MemOnly policy: Performance under only DRAM sub¬ 
sysfem power limit. 

• CPU+Mem policy: Performance under PPO and DRAM 
subsysfem power limifs. 

In our experimenfs, we manually configure fhe power 
limif using RAPE inferfaces. Por fhe CPUOnly and 
MemOnly policies, we manually sef 15 differenf power 
limifs below fhe average power consumption of fhe 
corresponding subsystem. These 15 different power limits 
start from the average power consumption to 28 watts less 
than average power consumption at steps of 2 watt each. 
Por the CPU+Mem policy we look at all possible power 
limits for a total of 225 combinations for each load-level. In 
this paper, we only present the best possible power savings 
without performance degradation for the benchmarks. We 
only present runs which achieve performance wifhin 1% 
of fargef load-level for SPECpower. In case of SPECweb, 
we present results which achieve the target load-level and 


SPECpower Processor Package (PKG) Power Consumption 



SPECpower Memory (DRAM) Power Consumption 



maintain TIME_GOOD and TIMEJTOLERABLE constraints 
(see Section 1^. We also use the least possible value as the 
time window for power limiting (i.e., 976 microseconds). 

6.2 Impact of Power Limiting 

Eigure shows the normalized power consumption of 
five differenf load-levels of SPECpower. The values are 
normalized against the power consumption at 100% load- 
level vanilla run. We show the power consumption for the 
full sysfem (fop leff) and processor package (fop righf), 
PPO (bottom leff) and DRAM (bottom righf) subsysfems. 
Such representation of the power consumption allows to 
identify whefher we achieve energy proporfionalify af a 
particular load-level. 

We observe that we achieve energy proportionality 
for fhe full sysfem only for 100% and 80% load-levels. 
However, power limiting reduces the power consumption 
for ofher load-levels even fhough we are nof able fo 
achieve energy-proportional operations. We would like to 
emphasize that the system consumes 36.51% of peak power 
even when idling. The CPU+Mem and CPUOnly policies 
achieves the best power consumption. The MemOnly policy 
achieves negligible power consumption reduction. In case 
of processor package power consumption, we are able 
to achieve energy-proportional operation for all loads- 
levels excepf 20% load-level. Moreover, fhe reduction in 
power consumption is more when compared fo fhe full 
sysfem. This is an expecfed oufcome as we only have 
power limiting control over processor package, PPO (which 
is a part of processor package) and DRAM subsysfems. 
Through power limiting, we achieve energy-proportional 
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SPECweb System Power Consumption 
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Fig. 10. Impact of Power Limiting on SPECweb 


SPECpower - PG Comparison After Power Cap 


SPECweb - PG Comparison After Power Cap 




Fig. 11. PG Metric after Power Cap 


operation for all load-levels when we look at PPO domain 
in isolation. As mentioned earlier, we achieve negligible 
power reduction from the DRAM subsystems while meet¬ 
ing the performance constraints of the benchmarks. In case 
of the subsystems, the different power limiting policies 
have same effect as seen for the full system (i.e., using 
CPU+Mem and CPUonly results in best possible power 
reduction whereas using MemOnly results in negligible 
power savings). 

Figure shows the normalized power consumption of 
five different load-levels of SPECweb benchmark. Similar 
to SPECpower, we show the power consumption for the 
full sysfem (fop leff) and processor package (fop righf), 
PPO (boffom leff) and DRAM (boffom righf) subsysfems. 
We achieve energy-proporfional operation only for 100% 
load-level in case of SPECweb. Similar to SPECpower, 
we however achieve power savings for ofher load-levels. 
The power reducfion for SPECweb is less than power 


reduction seen for SPECpower as SPECpower is more 
energy-proportional than SPECweb (see Section |^. We 
would also like to stress that the SPECweb benchmarks 
idles at 54.88% of ifs peak power. When looking at the 
processor package power consumption in isolation, we 
are able to achieve energy-proportional operation for 
100% and 80% load-levels. PPO domain provides the 
highest power reduction and achieves energy-proportional 
operation for all load-levels excepf 20% load-level. The 
memory subsystem does not contribute much to the power 
reduction. CPU+Mem and CPUOnly policies provide the 
best power reduction possible. 

Table shows the EP metric for fhe configurafion 
which achieves best power savings of the SPECpower and 
SPECweb benchmarks. We see components with EP metric 
> 1 indicating that we are operating at better than energy- 
proportional trade-off points. As expected, the EP metric 
for the PPO subsystem has seen a substantial increase. For 








































































































































































































































11 


SPECpower Power Savings 



Fig. 12. Power Savings 


SPECpower Energy Efficiency 



Fig. 13. Energy Efficiency 


TABLE 4 

Summary of Full System- and Subsystem-Level Energy 
Proportionality and Linear Deviation After Power Caps. 


Subsystem 

Benchmark 

EP Metric 

LD Metric 

Full System 

SPECpower 

0.69 

-0.044 

SPECweb 

0.48 

-0.024 

Package (PKG) 

SPECpower 

0.96 

-0.1490 

SPECweb 

0.79 

-0.101 

Core (PPO 

SPECpower 

1.18 

-0.221 

SPECweb 

1.12 

-0..192 

Uncore (Package-PPO) 

SPECpower 

0.16 

0.004 

SPECweb 

0.03 

0.019 

Memory (DRAM) 

SPECpower 

0.37 

0.013 

SPECweb 

0.09 

0.045 


the PPO domain, the metric increases from 0.85 to 1.18 and 
0.63 to 1.12 for the SPECpower and SPECweb benchmarks 
respectively. The memory and the uncore subsytem does 
see any significant EP metric improvement. 

Pigure shows the PG metric for fhe configurafion 
which achieve besf power savings. Por fhe Package and 
PPO subsysfems, the PG metrics is negative for some 
load-levels suggesting that we achieve better than energy- 
proportional operation. As observed both memory and 
uncore subsystem are not amenable to operating at different 
power performance trade-off points as the PG metric trend 
decrease linearly for those subsystems. 

Table also shows the LD metric for fhe best power 
savings run. We are able to shift the linear deviation 
from posifive fo negative for fhe full sysfem. Package 
and PPO subsystems. Our approach improves the energy 


SPECweb Power Savings 



SPECweb Energy Efficiency 



proportionality of the server by improving the linear 
deviation of subsysfems. 

6.3 Power Savings 

Pigure shows fhe power savings for fhe SPECpower 
(leff) and SPECweb (righf) benchmarks. Power limiting 
conserves between 3% to 15% of power at the full system- 
level. We would like to stress that the subsystems over 
which we don't have power limiting control consume 
between 11% and 17% power of the full system depending 
upon the load-level. The power savings at 100% load- 
level for SPECpower is less than SPECweb as the former 
is a CPU-intensive benchmark and most of our power 
savings come from the PPO domain. We observe that the 
memory subsystems provides negligible power savings. 
In case of PPO domain we conserve between 3% and 30% 
for SPECpower and 14% to 45% for SPECweb depending 
upon fhe load-level. 

6.4 Impact on Energy Efficiency 

Eigure shows the energy efficiency of SPECpower and 
SPECweb at five different load-levels for the CPU-tMem 
and vanilla runs. The energy efficiency of SPECweb and 
SPECpower are represented as ssj_ops/watt and SUS/watt 
respectively. The improvement is calculated as the ratio of 
difference between the energy efficiency under power limit 
and vanilla rim over the vanilla run. We achieve energy 
efficiency improvemenfs in all cases. SPECpower and 
SPECweb achieve up to 16 and 17 percent energy efficiency 
improvemenf, respectively, due to power limiting. 
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SPECpower Package Power Consumption Distribution 


SPECpower Memory Power Consumption Distribution 
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Normalized Package Power (Actual Power/Peak Power Limit Possible) 


Normalized Package Power (Actual Power/Peak Power Limit Possible) 


Fig. 14. Instantaneous SPECpower Package Power Con- Fig. 16. Instantaneous SPECpower Memory Power Con¬ 
sumption (After and Before Applying Power Caps) sumption (After and Before Applying Power Caps) 


SPECweb Package Power Consumption Distribution 
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SPECweb Memory Power Consumption Distribution 



Normalized Package Power (Actual Power/Peak Power Limit Possible) 


Normalized Memory Power (Actual Power/Peak Power Limit Possible) 


Fig. 15. Instantaneous SPECweb Package Power Con- Fig. 17. Instantaneous SPECweb Memory Power Consump- 
sumption (After and Before Applying Power Caps) tion (After and Before Applying Power Caps) 


6.5 Impact on Instantaneous Power Consumption 

Over-provisioning leads to the wasting of infrastructure re¬ 
sources, and the maximum instantaneous power consumed 
by the subsystems is an important factor in determining 
the power budget for a system. Determining the optimal 
power provisioning strategy requires an understanding of 
the instantaneous power profile of the system. Towards this 
end, the instantaneous power profile is discussed in fhis 
section. We describe the difference in instantaneous power 
profile between two different load-levels (40 and 60 percent) 
with and without power cap for bofh SPECpower and 
SPECweb benchmarks. The insfantaneous power profile 
for the configuration which achieved best power savings 
is shown. The power profile of the package and memory 
subsystems is collected at 50 ms resolution in all cases. 

Figurell4| and show the instantaneous power profile 
of the package subsystem at 40 and 60 percent load-level 
with and without the power cap for SPECpower and 
SPECweb respectively. Power limiting works as expected 
for both SPECpower and SPECweb benchmarks. The range 
of instantaneous power consumption is narrowed due to 
power capping. Moreover, the power limiting removes the 
relatively few power spikes indicated by the flat lines at 
100% (see Section]^. Such power limiting mechanisms 
are useful for power provisioning without impacting the 
performance of fhe application. 

Figures [T^ and 17 show the instantaneous power profile 
of the memory subsystem at 40 and 60 percent load-level 


with and without the power cap for both the benchmarks. 
The relatively few power spikes in the memory subsystem 
for bofh fhe benchmarks are removed due fo power 
limiting. Even though we don't achieve considerable power 
savings at the memory subsystem-level due to power 
limiting, applying appropriate power limits such that the 
impact on performance is controlled at desirable level can 
help make power provisioning decisions and increase the 
efficiency of fhe server. 


7 Related Work 

7.1 Energy-Proportional Operation For Enterprise 
Class Workloads 

Wong ef al. provide an architecture for improving 
the energy proportionality using server-level heterogeneity. 
They combine a high-power compute node with a low- 
power processor essentially creating two different power- 
performance operation regions. They save power by 
redirecting requests to the low-power processor at low 
request rates thereby improving energy proportionality. 
Our work looks at improving the energy proportionality 
of traditional servers by improving the subsystem-level 
energy proportionality using RAPE interfaces. 

Meisner et al. [T^ characterize online data-intensive 
services (OLDI) to identify opportunities for power man¬ 
agement, design a framework that predicts the performance 
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of OLDI workloads and investigate the power and perfor¬ 
mance frade-offs using fheir simulafion framework. Fan 
ef al. Q investigafe the benefits of energy-proporfional 
systems in improving the efficiency of power provisioning 
using fheir power models. They provide evidence thaf 
energy-proportional systems will enable improved power¬ 
capping at the data-center level. In contrast, we look at 
leveraging the pozver-capping mechanism to achieve energy- 
proportional operation for SPECpower and SPEC web. 

Tolia et al. proposed that by migrating workloads 
from under-utilized systems to other systems and turning 
the under-utilized systems off, energy proportionality can 
be approximated at an ensemble-level (i.e., for a group 
of nodes or rack-level). They used virfual machine (VM) 
migrafion as a soffware mechanism to move workloads 
off of under-utilized systems. In this paper, we use user- 
defined and hardware-enforced power limiting to achieve 
energy-proportional systems at the node-level. 

7.2 Subsystem-level Power Management 

Deng et al. Q propose the CoScale framework which 
dynamically adapts the frequency of the CPU and memory 
respecting a certain application performance degradation 
target. They also take per-core frequency seffings info 
account Q. Li et al. [j^| study the CPU microarchitectural 
adaptation and memory low power states to reduce energy 
consumption of applicafions bounding fhe performance 
loss by using a slack allocation algorithm. Our paper deals 
with subsystem-level power management on a real system. 

7.3 Power Limiting 

Several mechanisms to cap the power consumption of the 
system have been studied @, 0. However, we study 
the use of RAPE power limifing which is hardware- 
enforced in this paper. David et al. 0 proposed RAPE and 
evaluated RAPE for the memory sub-system. They present 
a model that accurately predicts the power consumed by 
the DIMMs and use RAPE to cap the power consumption. 
Rountree et al. p4| use RAPE power limiting to study 
the behavior of performance for benchmarks in the NAS 
parallel benchmark suite. Specifically, they are interested 
in the performance of various compute nodes under a 
power bound. Weaver et al. p4) have have exposed RAPE 
energy meters through PAPE We use RAPE interfaces to 
achieve energy-proportional operation for SPECpower and 
SPECweb benchmarks and to the best of our knowledge, 
fhere is no previous sfudy on using RAPE inferfaces for 
enferprise class server workloads. 

8 Conclusion 

The management of power and energy is a key issue for 
data centers. Efficient power management of enferprise- 
class server workloads have fhe pofenfial fo greatly 
reduce energy-related costs and facilitate efficient power 
provisioning. 


Energy proportionality holds the potential to signif¬ 
icantly improve the energy efficiency of dafa confers. 
Consequently, in this paper, we investigate the potential 
of achieving energy proporfionalify for SPECpower and 
SPECweb benchmarks using RAPE interfaces. Our study 
sheds light on the mechanisms for power managemenf 
of enferprise-class server workloads and the efficacy of 
RAPE inferfaces. We idenfify the least and most energy- 
proportional subsystem using the on-chip energy meters. 
We then characterize the instantaneous power profile of 
fhese benchmarks fo idenfify if fhere is any opporfunify 
fo limif fhe power consumpfion of fhese benchmarks. 
Pinally we presenf our resulfs on fhe impacf of power 
limiting on the power, performance and energy efficiency 
of SPECpower and SPECweb benchmarks. Our results 
show that we are able to achieve power savings of up to 
15%. 

Appendix A 

SPECpower Batch Size has No Effect on 
Power Consumption 
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Fig. 18. Average Power Consumption of SPECpower 

Figure shows the average power consumption of 
SPECpower benchmark af differenf load-levels. Eor full- 
system power measurement, we have followed the power 
measurement methodology specified and developed by the 
SPEC organization for the SPECpower benchmark |[T^ . The 
figure also shows the effect of changing the batch sizes in 
SPECpower. We wanted to quantify this effect as batching 
queries to exploit and create opportunities for power 
managemenf is a well-researched area The number 
of transacfions in each bafch scheduled in SPECpower 
benchmark is calibrated using the input.scheduler.batch_size 
input parameter. We use eight different batching sizes 
from 1000 to 8000 in steps of 1000. Each data presented 
is the average of 10 runs. The standard deviation for the 
power consumed during the runs were less than ±2% of 
fhe mean. 

Our results shed light on the repeatability of our 
experiments and the consistency of SPECpower benchmark. 
We observe fhaf fhe lines in the plot overlap each other. 
Based on our experiments, the batch sizes have minimal 
to no effect on the power consumed by the benchmark. 
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Similarly, changing batch sizes did not have any effect on 
the power consumption of subsystems (i.e., package, core 
and memory) as well. 


Appendix B 

SPECweb at 13000 SUS is Network- 
Intensive 


SPECweb Network Transmitted Bandwidth Distribution 
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Fig. 19. CDF of SPECweb Network Bandwidth 

Figure [T^ shows the cumulative distribution function 
(CDF) for fhe transmiffed nefwork bandwidth while 
running SPECweb at 13000 SUS. This CDF presents the 
percentage of total time where the transmitted bandwidth 
was either at or below a certain percentage of peak 
bandwidth possible. In our case the peak bandwidth is 256 
megabytes per seconds (MBPS) due to the bonded Ethernet 
connection on the testbed (see Section]^. We monitor the 
network bandwidth using the sar utility at a resolution of 
one second. We observe that the SPECweb benchmark at 
13000 SUS is networking intensive. The benchmark spends 
80% of the time consuming more than 80% of the network 
bandwidth. Moreover, it spends 50% of fime consuming 
more than 85% of fhe nefwork bandwidfh. Through our 
experiments we also found that the system under test was 
not able to meet the response time constraints when we 
increased the SUS beyond 13000. Hence, our experiments 
use 13000 SUS as 100% load-level for SPECweb. 



References 

[1] L. A. Barroso and U. Holzle. The Case for Energy-Proportional 
Computing. IEEE Computer, 40(12):33-37, 2007. 

[2] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & Cap: 
Adaptive DVFS and Thread Packing Under Power Caps. In Proc. of 
the Int'l Symp. on Microarchitecture, MICRO-44, pages 175-185, 2011. 

[3] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. 
RAPE: Memory Power Estimation And Capping. In Proceedings 
of the International Symposium on Low Power Electronics and Design, 
ISLPED, pages 189-194, 2010. 

[4] Q. Deng, D. Meisner, A. Bhattacharjee, T. E. Wenisch, and R. Bian- 
chini. CoScale: coordinating CPU and memory system DVFS in 
server systems. In Proceedings of the International Symposium on 
Microarchitecture, MICRO, 2012. 

[5] Q. Deng, D. Meisner, A. Bhattacharjee, T. F. Wenisch, and R. Bianchini. 
MultiScale: memory system DVFS with multiple memory controllers. 
In Proceedings of the International Symposium on Low Power Electronics 
and Design, ISLPED, 2012. 


[6] X. Fan, W.-D. Weber, and L. A. Barroso. Power Provisioning For 
a Warehouse-Sized Computer. In Proceedings of the International 
Symposium on Computer Architecture, ISCA, 2007. 

[7] A. Gandhi, M. Harchol-Balter, R. Das, J. Kephart, and C. Lefurgy. 
Power Capping Via Forced Idleness. In Proceedings of Workshop on 
Energy-Efficient Design, WEED, 2009. 

[8] C. Hsu and W. Feng. A Power-Aware Run-Time System for High- 
Performance Computing. In Proceedings of the SC Conference, 2005. 

[9] Intel. Intel Xeon Processor E5-2600 Product Family 
Uncore Performance Monitoring Guide, 2012. Available at 
http://www.intel.com/content/dam/www/public/us/en/ 
documents / design- guides / xeon- e5-2600- uncore- guide.pdf 

[10] Intel. Intel 64 and IA-32 Sottware Developer Manuals - Volume 3, 
2013. Available atwww.intel.com/content/www/us/en/processors/ 
architectures-software-developer-manuals.html 

[11] X. Li, R. Gupta, S. V. Adve, and Y. Zhou. Cross-Component Energy 
Management: Joint Adaptation of Processor and Memory. ACM 
Transactions on Architecture and Code Optimization, 2007. 

[12] D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating 
Server Idle Power. In Proceedings of International Conference on 
Architectural Support for Programming Languages and Operating Systems, 
ASPLOS, 2009. 

[13] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. 
Wenisch. Power Management of Online Data-Intensive Services. In 
Proceedings of the International Symposium on Computer Architecture, 
ISCA, pages 319-330, 2011. 

[14] B. Rountree, D. Ahn, B. de Supinski, D. Lowenthal, and M. Schulz. 
Beyond DVFS: A First Look at Performance Under a Hardware- 
Enforced Power Bound. In Proc. of the Int'l Parallel and Distributed 
Processing Symp. Workshops and PhD Forum, IPDPSW, pages 947-953, 
2012 . 

[15] B. Rountree, D. K. Lownenthal, B. R. de Supinski, M. Schulz, V. W. 
Freeh, and T. Bletsch. Adagio: Making DVS Practical For Complex 
HPC Applications. In Proc. of the Int'l Conf. on Supercomputing, ICS, 
pages 460M:69, 2009. 

[16] F. Ryckbosch, S. Polfliet, and L. Eeckhout. Trends in Server Energy 
Proportionality. IEEE Computer, (9):69-72, 2011. 

[17] D. C. Snowdon, S. M. Petters, and G. Heiser. Accurate On-line 
Prediction of Processor and Memory Energy Usage Under Voltage 
Scaling. In Proceedings of the International Conference on Embedded 
Software, EMSOFT, pages 84-93, 2007. 

[18] SPEC. SPECpower Benchmark, 2008. Available at http://www.spec. 
org/power_ssj2008 

[19] SPEC. SPECpower Benchmark - Benchmarking Methodology, 2008. 
Available at http://www.spec.org/power/docs/SPEC-Power_and_ 
Performance_Methodology.pdt 

[20] SPEC. SPECpower Benchmark - Run Rules, 2008. Available 
at http:/ /www.spec.org/power/docs/SPECpower_ssj2008-Run_ 
Reporting_Kules.html 

[21] SPEC. SPECweb2009 Benchmark - User Guide, 2009. Available at 
http: / / WWW. spec, org / web2009/ docs / usersguide .html 

[22] B. Subramaniam and W. Peng, towards Energy-Proportional 
Computing for Enterprise-Class Server Workloads. In Proceedings of 
the International Conference on Performance Engineering, ICPE, 2013. 

[23] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. 
Delivering Energy Proportionality with Non Energy-Proportional 
Systems: Optimizing the Ensemble. In Proceedings of the Conference On 
Power Aware Computing and Systems, HotPower. USENIX Association, 
2008. 

[24] V. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, 
D. Terpstra, and S. Moore. Measuring Energy and Power with PAPI. 
In Int'l Workshop on Power-Aware Systems and Architectures, 2012. 

[25] D. Wong and M. Annavaram. KnightShift: scaling the energy pro¬ 
portionality wall through server-level heterogeneity. In Proceedings 
of the International Symposium on Microarchitecture, MICRO, 2012. 

Balaji Subramaniam is a Ph.D. student in Computer Science department 
at Virginia Tech. His research interests include energy-proportional 
computing, power modeling and prediction, hardware- and software- 
controlled power management, and benohmarking. 

Wu-chun Feng is Professor and Elizabeth and James E. Turner Fellow 
of Computer Science at Virginia Tech. He received his Ph.D. in Computer 
Soience from the University of Illinois at Urbana-Champaign in 1996. 


































